CD8 TIL projection uncover cells types hidden by transient gene programs

Cell-Cycle, interferon and Tissue-residency gene programs

Background

Cells responding to external cues might overexpress transient gene programs. In scRNA-seq data, clusters will be driven by the strongest variation component, which can be sometimes transient gene programs. These programs such as the cell cycle, might hide the original, potentially more stable cell types.

These are among the main types of programs that classically are being upregulated in scRNA-seq data:

  • Cell cycle, when cells go through division (eg. MCM5, TOP2A)

  • IFN response, in response to IFN stimulation (eg. MX1, ISG15)

  • HSP response, in response to stress (eg. HSPA1, HSPA2)

  • Activation response, in response to an activation signal, eg TCR engagement (eg. JUN, FOS)

  • Tissue-residency program, when cells infiltrate and home within tissues (eg. ZNF683, ITGAE)

Note: while some programs are transient (like the cell cycle), some programs might be more permanent / epigenetically fixed. The distinction between transient and permanent might be harder for some programs, like tissue-residency. We argue here that having a well-curated reference, free from many transient gene programs, will help in distinguishing transient from more stable cell states.

Here we will see how we use reference-based projection to uncover the original cell types from Cycling, IFN-activated and Tissue-resident cells using data from Gueguen et al. 2021.

How to identify gene programs?

Coregulated gene sets can also be found using previously defined literature but also using unsupervised techniques. Gene modules can be found using PCA, ICA, cNMF, scCoGAPS, or find_gene_modules from Monocle3, which runs UMAP on the genes and then groups them into modules using Louvain clustering. Here, for simplicity, we chose to focus on literature base markers, which are easily explainable, explicit programs.

Data setup

Loading the CD8 reference

First, let’s load the CD8 TIL reference by downloading it from Figshare.

# Load the reference
options(timeout = max(900, getOption("timeout")))
#download.file("https://figshare.com/ndownloader/files/38921366", destfile = "CD8T_human_ref_v1.rds")
ref.cd8 <- load.reference.map("CD8T_human_ref_v1.rds")
## [1] "Loading Custom Reference Atlas..."
## [1] "Loaded Custom Reference map Human CD8 TILs"
# Setup colors
mycols <- ref.cd8@misc$atlas.palette

# Plotting
DimPlot(ref.cd8,  group.by = 'functional.cluster', label = T, repel = T, cols = mycols) + theme(aspect.ratio = 1)

We see that the reference is composed only of what should only be stable cell types. Indeed, cells expressing highly transient gene programs (Cycling and IFN) were removed when constructing the reference.

Load Gueguen et al. data

#download.file("https://figshare.com/ndownloader/files/39082049", destfile = "gueguen.cd3.Rds")
gueguen.cd3 <- readRDS("gueguen.cd3.Rds")
gueguen.cd3$seurat_clusters <- Idents(gueguen.cd3)

# DimPlot
DimPlot(gueguen.cd3, order = T,  label = T, repel = T) 

Projection using reference map

The first step is to project using ProjecTILs ProjecTILs.classifier function. This mode will give us the labels while keeping the original UMAP embeddings.

# Projection
DefaultAssay(gueguen.cd3) <- "RNA"
gueguen.cd3 <- ProjecTILs.classifier(gueguen.cd3, ref = ref.cd8, filter.cells = T, split.by = 'orig.ident', ncores = 6)
table(gueguen.cd3$functional.cluster)
## 
##        CD8.CM        CD8.EM      CD8.MAIT CD8.NaiveLike     CD8.TEMRA 
##          2132          3181           147           238           276 
##       CD8.TEX      CD8.TPEX 
##          3340           337
# Plotting
DimPlot(gueguen.cd3, group.by = 'functional.cluster', order = T, cols = mycols, label = T, repel = T)

Augmenting cell type classifications with gene program scores

To further augment granularity, here we divide the dataset by computing signatures scores using UCell. To this end, we use our collection of useful transcriptional gene signatures: SignatuR, as well as manually curated signatures.

Here, we will use signatures generated for:

  • G1S cell cycle

  • G2M cell cycle

  • IFN response

  • Tissue residency

library(SignatuR)
signatures <- GetSignature(SignatuR$Hs)
signatures <- signatures[c("cellCycle.G1S","cellCycle.G2M")]

# Using manual signatures for IFN and resident, which work better with UCell than the full signatures from SignatuR package
signatures[["IFN"]] <- c("ISG15","IFI6","IFI44L","MX1")
signatures[["Tissue_Resident"]] <-  c("ITGAE")

Now let’s add the scores using UCell.

gueguen.cd3 <- AddModuleScore_UCell(gueguen.cd3, features = signatures, ncores = 8,
                       assay = "RNA")
FeaturePlot(gueguen.cd3, features = paste0(names(signatures), "_UCell"),
    order = T, pt.size = 0.3) & scale_color_paletteer_c("pals::coolwarm") 
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.

We see that these signatures correlate with their annotation of the original paper.

Now, let’s check signatures scores distribution to decide which threshold to put to distinguish negative/positive cells.

Signatures scores distributions

qplot(gueguen.cd3$cellCycle.G1S_UCell, main = "G1S-Cycling") +
    qplot(gueguen.cd3$cellCycle.G2M_UCell, main = "G2M-Cycling") + qplot(gueguen.cd3$IFN_UCell, main = "IFN") + qplot(gueguen.cd3$Tissue_Resident_UCell, main = "Tissue Resident") & theme_bw()

# subset resident high cells in the CD8 reference
resident.high <- subset(gueguen.cd3, subset = Tissue_Resident_UCell > 0.2)
resident.low <- subset(gueguen.cd3, subset = Tissue_Resident_UCell <= 0.2)

# Set list
resident.list <- list(resident.low,resident.high)
names(resident.list) <- c("resident.low","resident.high")
# Radars to check differences between resident / non resident
plot.states.radar(ref = ref.cd8, query = resident.list, labels.col = 'functional.cluster', min.cells = 5, genes4radar = c("TCF7","CCR7","KLF2","GZMK","LMNA","FCGR3A","FGFBP2","XCL1","KLRB1", 'ZNF683',"ITGAE", 'ITGA1')) 

Resident-high and resident-low cells display the same overall phenotypes, but differ in their expression of key resident genes, upregulated when cells home within tissue, such as integrins (ITGAE, ITGA1) or the transcription factor HOBIT (ZNF683).

Indeed, the radars confirm that the annotation seems of good quality. Radar plots are a good way to show how phenotypes match between the reference and the query, but they differ by the resident genes (ZNF683, ITGA1, ITGAE and CXCR6). We also confirm that the cluster CD8-XCL1, originally identified as an early-precursors, indeed matches well the phenotype of Central Memory cells, while still being high for tissue-residency signal.

By looking at the UCell scores distribution, we categorised the cells by putting a UCell score threshold of 0.2.

wrap_plots(pll, guides = "collect")

As previously shown, CD8.TEX and CD8.TPEX are the most cycling, resident and IFN-responding subsets. For the IFN signal, results are inconsistent between UCell signature scores and projected clustes. Results will vary depending on the signatures chosen, the UCell threshold chosen, as well as the clustering resolution.

Upset plot to check program co-occurences

Let’s focus on the CD8.TEX population, and see how the different gene programs relate using a UpSet plot.

gueguen.cd3.TEX <- subset(gueguen.cd3, subset = functional.cluster == "CD8.TEX")
listInput <- list(IFN = which(gueguen.cd3.TEX$IFN.class == "Positive"), 
                  Cycling.G2M = which(gueguen.cd3.TEX$cellCycle.G2M.class == "Positive"), 
                  Cycling.G1S = which(gueguen.cd3.TEX$cellCycle.G1S.class == "Positive"),
                  Tissue_resident = which(gueguen.cd3.TEX$Tissue_Resident.class == "Positive"))


upset(fromList(listInput), order.by = "freq") 
grid.text("CD8.TEX gene programs overlap",x = 0.65, y=0.95, gp=gpar(fontsize=12))

We can see that in CD8.TEX, broadly half of the IFN high cells are also high for the tissue-residency program. This should be explored further.

Comparison with the original annotation

In the original study, the authors use Seurat TransferData function to transfer labels from the non-cycling clusters to the cycling clusters (‘CD4/8-TOP2A’, ‘CD4/8-MCM5’). With label transfer, Cycling cells are defined as CD8.LAYN and CD8.GZMH.

Now let’s compare this result with our projection results. First, let’s select the 2 cycling clusters.

gueguen.cycling <- subset(gueguen.cd3, subset = seurat_clusters %in% c('CD4/8-TOP2A', "CD4/8-MCM5"))

# Dimplot
DimPlot(gueguen.cycling, group.by = 'functional.cluster', label = T, repel = T, cols = mycols)
## Warning: ggrepel: 1 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

Clusters proportions in cycling

plot.statepred.composition(ref.cd8, gueguen.cycling, metric = "Percent") +
    ggtitle("Cycling cells") + ylim(0, 75) + theme_bw() + RotatedAxis()

Let’s check radar plots to confirm identities, where we included some markers genes for cell cycle: TOP2A, MCM5 and MCM3.

plot.states.radar(ref = ref.cd8, query = gueguen.cycling, labels.col = 'functional.cluster', min.cells = 5, genes4radar = c('IL7R','LEF1', "TCF7","CCR7","KLF2","GZMK","FCGR3A","FGFBP2","XCL1","KLRB1",'TOP2A','MCM5','MCM3'))

The radar plots match well between the reference and the query. It seems that we have indeed less differentiated cells in the cycling compartment (Central Memory and Effector Memory). Projection seems more robust than Seurat label transfer to uncover cell types hidden behind transient gene programs.

Session Info

sessionInfo()
## R version 4.2.1 (2022-06-23)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Ventura 13.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices datasets  utils     methods  
## [8] base     
## 
## other attached packages:
##  [1] SignatuR_0.1.0      UpSetR_1.4.0        cowplot_1.1.1      
##  [4] plotly_4.10.1       pheatmap_1.0.12     UCell_2.2.0        
##  [7] paletteer_1.5.0     scales_1.2.1        ggrepel_0.9.2      
## [10] gridExtra_2.3       ProjecTILs_3.0.2    patchwork_1.1.2    
## [13] GEOquery_2.66.0     Biobase_2.58.0      BiocGenerics_0.44.0
## [16] data.table_1.14.6   STACAS_2.0.0        scGate_1.4.1       
## [19] forcats_0.5.2       stringr_1.5.0       dplyr_1.0.10       
## [22] purrr_1.0.1         readr_2.1.3         tidyr_1.2.1        
## [25] tibble_3.1.8        ggplot2_3.4.0       tidyverse_1.3.2    
## [28] SeuratObject_4.1.3  Seurat_4.3.0       
## 
## loaded via a namespace (and not attached):
##   [1] utf8_1.2.2                  spatstat.explore_3.0-5     
##   [3] reticulate_1.27             tidyselect_1.2.0           
##   [5] htmlwidgets_1.6.1           BiocParallel_1.32.5        
##   [7] Rtsne_0.16                  munsell_0.5.0              
##   [9] codetools_0.2-18            ica_1.0-3                  
##  [11] umap_0.2.9.0                future_1.30.0              
##  [13] miniUI_0.1.1.1              withr_2.5.0                
##  [15] spatstat.random_3.1-2       colorspace_2.1-0           
##  [17] progressr_0.13.0            highr_0.10                 
##  [19] knitr_1.41                  rstudioapi_0.14            
##  [21] stats4_4.2.1                SingleCellExperiment_1.20.0
##  [23] ROCR_1.0-11                 tensor_1.5                 
##  [25] listenv_0.9.0               labeling_0.4.2             
##  [27] MatrixGenerics_1.10.0       GenomeInfoDbData_1.2.9     
##  [29] polyclip_1.10-4             farver_2.1.1               
##  [31] parallelly_1.34.0           vctrs_0.5.2                
##  [33] generics_0.1.3              xfun_0.36                  
##  [35] timechange_0.2.0            R6_2.5.1                   
##  [37] GenomeInfoDb_1.34.7         rmdformats_1.0.4           
##  [39] pals_1.7                    bitops_1.0-7               
##  [41] spatstat.utils_3.0-1        cachem_1.0.6               
##  [43] DelayedArray_0.24.0         assertthat_0.2.1           
##  [45] promises_1.2.0.1            googlesheets4_1.0.1        
##  [47] gtable_0.3.1                globals_0.16.2             
##  [49] goftest_1.2-3               rlang_1.0.6                
##  [51] splines_4.2.1               lazyeval_0.2.2             
##  [53] gargle_1.2.1                dichromat_2.0-0.1          
##  [55] prismatic_1.1.1             spatstat.geom_3.0-5        
##  [57] broom_1.0.2                 BiocManager_1.30.19        
##  [59] yaml_2.3.7                  reshape2_1.4.4             
##  [61] abind_1.4-5                 modelr_0.1.10              
##  [63] backports_1.4.1             httpuv_1.6.8               
##  [65] tools_4.2.1                 bookdown_0.32              
##  [67] ellipsis_0.3.2              jquerylib_0.1.4            
##  [69] RColorBrewer_1.1-3          ggridges_0.5.4             
##  [71] Rcpp_1.0.10                 plyr_1.8.8                 
##  [73] zlibbioc_1.44.0             RCurl_1.98-1.9             
##  [75] openssl_2.0.5               deldir_1.0-6               
##  [77] pbapply_1.7-0               S4Vectors_0.36.1           
##  [79] zoo_1.8-11                  SummarizedExperiment_1.28.0
##  [81] haven_2.5.1                 cluster_2.1.4              
##  [83] fs_1.6.0                    magrittr_2.0.3             
##  [85] RSpectra_0.16-1             scattermore_0.8            
##  [87] lmtest_0.9-40               reprex_2.0.2               
##  [89] RANN_2.6.1                  googledrive_2.0.0          
##  [91] fitdistrplus_1.1-8          matrixStats_0.63.0         
##  [93] hms_1.1.2                   mime_0.12                  
##  [95] evaluate_0.20               xtable_1.8-4               
##  [97] readxl_1.4.1                IRanges_2.32.0             
##  [99] compiler_4.2.1              maps_3.4.1                 
## [101] KernSmooth_2.23-20          crayon_1.5.2               
## [103] htmltools_0.5.4             later_1.3.0                
## [105] tzdb_0.3.0                  lubridate_1.9.0            
## [107] DBI_1.1.3                   dbplyr_2.3.0               
## [109] MASS_7.3-58.2               data.tree_1.0.0            
## [111] Matrix_1.5-3                cli_3.6.0                  
## [113] parallel_4.2.1              igraph_1.3.5               
## [115] GenomicRanges_1.50.2        pkgconfig_2.0.3            
## [117] sp_1.6-0                    spatstat.sparse_3.0-0      
## [119] xml2_1.3.3                  bslib_0.4.2                
## [121] XVector_0.38.0              rvest_1.0.3                
## [123] digest_0.6.31               pracma_2.4.2               
## [125] sctransform_0.3.5           RcppAnnoy_0.0.20           
## [127] spatstat.data_3.0-0         rmarkdown_2.20             
## [129] cellranger_1.1.0            leiden_0.4.3               
## [131] uwot_0.1.14                 shiny_1.7.4                
## [133] lifecycle_1.0.3             nlme_3.1-161               
## [135] jsonlite_1.8.4              BiocNeighbors_1.16.0       
## [137] mapproj_1.2.9               askpass_1.1                
## [139] viridisLite_0.4.1           limma_3.54.0               
## [141] fansi_1.0.4                 pillar_1.8.1               
## [143] lattice_0.20-45             fastmap_1.1.0              
## [145] httr_1.4.4                  survival_3.5-0             
## [147] glue_1.6.2                  png_0.1-8                  
## [149] stringi_1.7.12              sass_0.4.5                 
## [151] rematch2_2.1.2              renv_0.15.5                
## [153] irlba_2.3.5.1               future.apply_1.10.0